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Abstract. The provision of mechanisms for processor allocation in current distributed 
parallel programming models is very limited. This makes difficult, or even prohibits, 
the expression of a large class of programs which require a run-time assessment of 
their required resources. This includes programs whose structure is irregular, compos- 
ite or unbounded. Efficient allocation of processors requires a process creation mech- 
anism able to initiate and terminate remote computations quickly. This paper presents 
the design, demonstration and analysis of an explicit mechanism to do this, imple- 
mented on the XMOS XS 1 architecture, as a foundation for a more dynamic scheme. 
It shows that process creation can be made efficient so that it incurs only a fractional 
overhead of the total runtime and that it can be combined naturally with recursion to 
enable rapid distribution of computations over a system. 
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Introduction 

An essential issue in the design of scalable, distributed parallel computers is the rate at which 
computations can be initiated, and results collected as they terminate [1]. This requires an 
efficient method of process creation capable of dispatching a program and data on which to 
operate to a remote processor. This paper presents the design, implementation, demonstration 
and evaluation of a process creation mechanism for the XMOS XSl architecture 1*2]. 

Parallelism is being employed on an increasingly large scale to improve performance 
of computer systems, particularly in high performance systems, but increasingly in other ar- 
eas such as embedded computing [O. As current programming models such as MPI (Mes- 
sage Passing Interface) provide limited support for automated management of processing re- 
sources, the burden of doing this mainly falls on the programmer. These issues are not rel- 
evant to the expression of a program as, in general, a programmer is concerned only with 
introducing parallelism (execution on multiple processors) to improve performance, and not 
how the computation is scheduled on the underlying system. When we consider that future 
high performance systems will run on the order of 10^ threads [4J, it is clear that the pro- 
gramming model must provide some means of dynamic processor allocation to remove this 
burden. This is the situation we have with memory in sequential systems, where allocation 
and deallocation is performed with varying degrees of automaticy. 

This observation is not new [5 6], but it is only as existing programming models and 
software struggle to meet the increasing scale of parallelism that the problem is again coming 
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to light. For instance, capabilities for process creation and management were introduced in 
the MPI-2.0 specification, stating that: ''Reasons for including process management in MPI 
are both technical and practical. Important classes of message-passing applications require 
this control. These include task farms, serial applications with parallel modules and prob- 
lems that require a run-time assessment of the number and type of processes that should be 
started" [|7J. Several MPI implementations support process creation and management func- 
tionality, but it is pitched as an 'advanced' feature that is difficult to use and problematic 
with many current job-scheduling systems. More encouragingly, language-level abstractions 
for dynamic process creation and placement have appeared recently in the Chapel [8j and 
XIO flU, which are being developed by Cray and IBM respectively as part of DARPA's High 
Productivity Computing Systems program. Both support these concepts as key ingredients in 
the design of parallel programs, but they are built on software communication libraries and 
statically-mapped program binaries. Consequently, they are subject to the same communica- 
tion inefficiencies and inflexibility of single -program approaches. 

A run-time assessment of required processing resources concerns large class of programs 
whose structure is irregular, such as unstructured-grid algorithms like the Spectral Element 
Method [[Toll , unbounded such as recursively-structured algorithms like Branch-and-Bound 
search [TT] and Adaptive Mesh Refinement [fT2l . or composite, where a program may be 
composed of different parallel subroutines that are themselves executed in parallel, possibly 
each with its own structure. These all require a means of dynamic processor allocation that 
is able to distribute computations over a set of processors, depending on requirements de- 
termined at runtime. The combination of parallelism and recursion is a powerful mechanism 
for growth which can be used to implement distribution efficiently. This must be supported 
with a mechanism for process creation with the ability to dispatch, initiate and terminate 
computations efficiently on remote processors. 

This paper presents the design and implementation of an explicit scheme for dynamic 
process creation in a distributed memory parallel computer. This work is intended to be a key 
building block for a more automatic scheme. The implementation is on the the XMOS XSl 
architecture, which has low-level provisions for concurrency, allowing a convincing proof- 
of-concept implementation. Based on this, the process creation mechanism is evaluated by 
combining it with controlled recursion in two simple algorithms to demonstrate the rate and 
granularity at which it is possible to create remote computations. Performance models are 
developed in each case to interpret the measured results and to make predictions for larger 
systems and workloads. This analysis highlights the efficiency, scalability and effectiveness 
of the concept and approach taken. 

The rest of this paper is structured as follows. Section [T] describes the XSl architec- 
ture, the experimental platform and the notations and conventions used. Section [2] gives a 
brief overview of the design and implementation details. Section |3]presents the performance 
models and experimental and predicted results. Finally, Section |4] concludes and Section |5] 
discusses possible future extensions to the work. 

1. Background 

1.1. Platform 

The XMOS XSl processor architecture f2'| is general-purpose, multi-threaded, scalable and 
has been designed from the ground up to support concurrency. It allows systems to be con- 
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structed from multiple XCore processors which communicate with each other through fast 
communication links. The key novel aspect of this architecture with respect to the work in 
this paper is the instruction set support for processes and communication. Low-level thread- 
ing and communication are key features, exposed with operations, for example, to provide 
synchronous and asynchronous fork-join thread-level parallelism and channel-based message 
passing communication. Provision of these features in hardware allows them to be performed 
in the same order of magnitude of time as memory references, branches and arithmetic. This 
allows efficient high-level notations for concurrency to be effectively built. 

The system used to demonstrate and evaluate the proposed process creation mechanism 
is an experimental board called the XK-XMP-64 [13J. It connects together 64 XCore pro- 
cessors in 16 XS1-G4 devices which run at 400MHz. The G4 devices are interconnected 
in a 4-dimensional hypercube which equivalently can be viewed as a 2-dimensional torus. 
Mathematically, this is defined in the following way [fT4ll : 

Definition 1. A J-dimensional hypercube is a graph G = {N,E) where is the set of 2^ 
nodes and E is the set of edges. Each node is labeled with a J-bit identifier. For any m,n E N, 
an edge exists between m and n if and only if 

m © n = 2^ 

for < k < d where © is the bitwise exclusive-or operator. Hence, each node has d = logN 
edges and l^l = dl'^^^. 

Each core in the G4 package has a private 64kB memory and is interconnected via inter- 
nal links to an integrated switch. It is convenient to view the whole system as a 6-dimensional 
hypercube. As each core can run 8 hardware threads, the system is capable of 5 12- way con- 
currency with an aggregate 25.6 GIFS performance. 

1.2. Notation 

For presentation of the algorithms in this paper, a simple imperative, block-structured no- 
tation is used. The following points describe the non-standard elements that appear in the 
examples. 

1.2.1. Sequential and Parallel Composition 

A set of instructions that are to be executed in sequence are composed with the ';' separator. 
A sequence of instructions comprises a process. For example, the block 

{h;h;h} 

defines a simple process to perform three instructions, /i, I2 and I2 in sequence. Processes 
may be executed in parallel by composition within a block with the '|' separator. Execution 
of a parallel block initiates the execution of the constituent processes simultaneously. The 
parallel block successfully terminates only when all processes have successfully terminated. 
This is referred to as synchronous fork-join parallelism. For example, the block declaration 

{Pl\P2\P3} 



denotes the parallel execution of three processes Pi, P2 and ^3. 
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1.2.2. Aliasing 

The aliases statement is used to create new references to sub-sections of an array. For exam- 
ple, the statement 

A aliases B[/. . .7] 
sets A to refer to the sub-section of B in the index range / to j. 

1.2.3. Process Creation 

The on statement reveals explicitly to the programmer the process creation mechanism. The 
statement 

on /7 do P 

is semantically equivalent to executing a call to P, except that process P is transmitted to pro- 
cessor p, which then executes P and communicates back any results using channels, leaving 
the original processor free to perform other tasks. By composing on in parallel, we can exploit 
multi-threaded parallelism to offload work while executing another process. For example, the 
statement 

{ Pi I on p do P2 } 

causes Pi to be executed while P2 is offloaded and executed on processor p. 

1.3. Measurements 

All timing measurements presented were made with hardware timers, which are accessible 
through the ISA and have 10ns resolution. Constant values were extrapolated through the 
measurements taken by fitting performance models to the data. 

1.4. Conventions 

All logarithms are to the base 2. p is defined as the number of processors and is taken to be a 
positive power of two. A word is taken to be 4 bytes and is a unit of input in the performance 
models. 

2. Implementation 

The on statement causes the closure of a process P located at a guest processor to be sent to 
a remote host processor, the host to execute P and to send back any updated free variables 
of P stored at the guest. The execution of on is synchronous in this respect. The closure of 
a process P is a complete description of P allowing it to be executed independently and is 
defined in the following way: 

Definition 2. The closure C of a process P consists of three elements: a set of arguments A, 
which represents the complete variable context of P as we don't consider global variables, a 
set of procedure indicies / and a set of procedures Q: 

C{P) = {A,I,Q) 

where |A| > and |/| = \Q\ > 1. Each argument a e A is a ordered sequence of one or more 
integer values. Each process P G 2 is an ordered sequence of one or more instructions. Ip is 
an integer value denoting the index of procedure P. 
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Each core maintains a fixed-size jump table denoted 'jump', which records the location 
of each procedure in memory. As the procedure address may not be consistent between cores 
the indicies are guaranteed to be. This allows relative branches to be expressed in terms of 

an index which is locally referenced at execution. Each node in the system is initialised with 
a minimal binary containing the process creation kernel. The complete program is loaded on 
node 0, from where parts of it can be copied onto other nodes to be executed. 

2.1. Protocol 

The process creation mechanism is implemented as a point-to-point protocol between a guest 
core and a host core. Any running thread is able to spawn the execution of a process on any 
other core. It consists of the following four phases. 

2.1.1. Connection Initialisation 

A guest initiates a connection by sending a single byte control token and a word identifying it- 
self. It waits for an acknowledgment from the host indicating a host thread has been allocated 
and the connection is properly established. A core may host multiple guest computations, 
each on a different thread. 

2.1.2. Transmission of Closure 

C{P) is transmitted in three parts. Firstly, a header is sent containing |A| and \Q\. Secondly, 
each a e A is sent with a single word header denoting the type of the argument. For referenced 

arrays, this is followed by length((2) and the values contained. The host writes these directly 
into heap-allocated space and the argument value is set to this address. Single- value variables 
are treated similarly and constant values can be copied directly into the argument value. 
Lastly, each P e Qis sent with a two word header denoting Ip and length(/') in bytes. The 
host allocates space on the heap and receives the instructions of P from the guest, read from 
memory in word-chunks from jump [//>] to jump[7p] -|-length(P). On completion, the host sets 
jump[/p] to the address of P on the heap. 

2.1.3. Execution/Wait for Completion 

Once C has been successfully transmitted, the host initialises the thread's registers and stack 
with the arguments of P and initiates execution. The connection is left open and the guest 
thread waits for the host to indicate P has halted. 

2.1.4. Transmission of Results and Teardown 

Once P has halted, all referenced array and variable arguments contained in C (now the 
results) are transmitted back to the guest. The guest writes them back directly to their original 
locations. Once this has been completed, the connection is terminated. The guest continues 
execution and the host thread frees the memory allocated to the closure and yields. 

2.2. Performance Model 

The runtime cost of this mechanism is captured in the following way: 

Definition 3. The runtime of process creation Tc is a function of the total size of the argument 
values n, procedure descriptions m and the results o and is given by 
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proc distribute it, n) is 
'\in — \ then node {t) 
else 

{ distribute {t, n/2) 

I on ? + n/2 do distribute {t + n/2, n/2) } 

Figure 1. A recursive process distribute to rapidly distribute another process node over a set of processors. 

Tc (n, m, o) = {Ci + Cwn + C^m + Qo) ■ C/ 

where C, and Cw are constants relating to initialisation and termination, and overhead per 
(word) value transmitted respectively. The value n is inclusive of the size of referenced arrays 
and hence o < n. As all communication is synchronised, Q is a constant factor overhead 
relating to the latency of the path between the guest and host processors. 

Normalising Q = 1 to a single hop off-chip, the per- word overhead was measured as 
150ns. The initialisation overhead Q is dependent on the size of the closure. 

3. Demonstration and Evaluation 

The aim of this section is to demonstrate the use of process creation combined with paral- 
lel recursion to evaluate the performance of the design and its implementation in realising 
efficient growth. To do this, we develop performance models to combine with experimental 
results, allowing us to extrapolate to larger systems and inputs. We start with a simple algo- 
rithm to demonstrate the fast distribution of parallel computations and then show how this 
can be applied to a practical problem. 

3.1. Rapid Process Distribution 

The algorithm distribute given in Figure [T] is inspired by ([Hi and works by spawning a new 
copy of itself on a remote processor each time it recurses. Each process then itself recurses, 
continuing this behaviour and hence, each level of the recursion subdivides the set of pro- 
cessors in half, resulting in a doubling of the capacity to initiate computations. This growth 
follows the structure of a binary tree. When each instance of distribute executes with n = I, 
the node process is executed and the recursion halted. The parameter t indicates the node 
identifier and the algorithm is executed from node with t = and n = p. 

3.1.1. Runtime 

The hypercube interconnection topology of the XK-XMP-64 provides an optimal transport in 
terms of hop distance between remote creations; this is established by the following theorem. 

Theorem 1. Every copy of distribute is always created on a neighbouring node when executed 
on a hypercube. 

Proof. Let H = (N,E) be a J-dimensional hypercube. When distribute is executed with t = 
and n = N, starting at node on H, the recursion follows the structure of a binary tree of 
depth d = log |A^|, where identifiers at level / are multiples of \N\/2'. A node p at depth i with 
identifier k\N\/2' creates a new remote child node c with identifier k\N\/T + \N\/2'^^ . As 
|A^| = 2^, c = A:2^-' + 2^-'-i and hence, p®c = 2'^-''^. □ 



J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XSl Architecture 



7 



120 
100 
^ 80 
^ 60 
^ 40 
20 




1 1 


X 

- 




★Jrf — - 


1 1 





20 



15 

3 

^ 10 



10 20 30 40 50 60 
P 



2 3 4 
Level 



(a) Measured vs. predicted (★) execution time. (b) Execution times for each level of recursion of dis- 
tribute . 



Figure 2. Measured execution time of distribute over varying numbers of processors, (b) clearly shows the 
inter- vs. intra-chip latencies. 

Given that m and n are fixed, that o = (there are no results) and from Theorem [T] we 
can normalise Q to 1, the runtime Tc{m,n,o) of the on statement in distribute is 0(1), which 
we define as the initialisation overhead Cj. Using this, we can express the parallel runtime 
of distribute on p processors. In each step, the number of active processes double, but we 
count the runtime at each level of recursion, which terminates when n/2' = 1 or / = Xogn. 
Hence, 

logp 
i=\ 

= {Cj+Co)\ogp (1) 

where Co is the the sequential overhead at each level. Cj was measured as 18.4jus and Co was 
measured as 60ns. 

3.1.2. Results 



Figure 2a gives the predicted and measured execution time of distribute as a function of the 
number of processors. The prediction almost exactly matches the runtime given by Equa- 
tion [T| Figure 2b shows the inaccuracy between the measured and predicted results more 
clearly, by giving the measured execution time for each level in the recursion, that is, the 



difference between consecutive points in Figure 2a It shows that the assumption made based 
on Theorem [T] does not hold and that the first two levels take fractionally less time than the 
last four levels (3.85/is). This is due to the reduced on-chip communication costs. Overall 
though, each level of recursion completes on average in 18.9/is and it takes only 114.60/is 
to populate all 64 processors. Moreover, using the performance model given by T^, we can 
extrapolate to larger p than is possible to measure with the current platform. For example, 
when p = 1024, 7^(1024) = 190/is. 

3.1.3. Remarks 

By using the performance model to make predictions, we have assumed a hypercube topology 
and efficient support for concurrency. Although other architectures and larger systems cannot 
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make such provisions, the model and results provide a reasonable lower bound on execution 
time with respect to the approach described. 

The hypercube has rich communication properties and supports exponential growth, but 
it does not scale well due to the number of connections at each node and length of wires in 
realistic packagings. Although distribute has optimal single-hop behaviour and we obtain peak 
performance, it is well known that efficient embeddings of binary trees into lower-degree 
networks such as meshes and tori exist [[T4|. allowing reasonable dispersion. In this case, 
the granularity of process creation would have to be chosen to match the capabilities of the 
architecture. 

Provision of efficient ISA-level operations for processes and communications allows 
fine-grained performance, particularly in terms of short messages. Many current architectures 
do not support these operations at a such a low-level and cannot exploit the full potential of 
this approach, although again it generalises at a coarser granularity of message size to match 
the relative performance of these operations. 



3.2. Merge sort 



Mergesort is a well known sorting algorithm [fTSl that works by recursively halving a list 
of unsorted numbers until unit sub-lists are obtained. These are then successively merged 
together such that each merging step produces a sorted sub-list, which can be performed 
in time &{n) for sub-lists of size n/2. Figure 3a gives the sequential mergesort algorithm 
seq-msort . 

Mergesort's branching recursive structure matches that of distribute , allowing us to com- 
bine them to obtain a parallel version. Instead of sequentially evaluating the recursive calls, 
conditional on some threshold value Qh, a local recursive call is made in parallel with the 
second call which is migrated to a remote core. This threshold is used to control the extent to 
which the computation is distributed. In each of the experiments for an input of size 2^ and 
available processors p = 2^, the threshold is set as 2^/ p. The approach taken in distribute is 
used to control the placements of each of the sub-computations. Initially, the problem is split 
in half; this will have the greatest benefit to the execution time. Depending on the problem 
size, further remote branchings of the problem may not be economical, and the remaining 
steps should be evaluated locally, in sequence. In this case, the algorithm simply reduces to 
seq-msort . 

This parallel formulation of mergesort is essentially just distribute with additional work 
and communication overhead, but it will allow us to more concretely quantify the relative 
costs of process creation. The parallel implementation of mergesort par-msort is given in Fig- 
ure [3b] It uses the same sequential merge procedure and the parameters t and n control the 



placement of processes in the same way as they were used with distribute . 

We can now analyse the performance and behaviour of par-msort and the process creation 
mechanism by looking at the parallel runtime. 



3.2.1. Runtime 

We first define the runtime of the sequential components of par-msort . This includes the 
sequential merging and sorting procedures. The runtime Tm of merge is linear and is defined 
as 



T,n{n) =Can + Cb 



J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XSl Architecture 



9 



proc par-msort (t, n. A) is 
if |A| > 1 then 
{a aliases A[0...|A|/2-l] 
; b aliases A[z. . . |A|] 
;if |A| > Q/, then 
{ par-msort (t, n/2, a) 
I ont + n/2 do 

par-msort (t + n/2, n/2, b) } 

else 

{ par-msort n/2,a) 
; par-msort {t + n/2, n/2,b) } 
; merge{A,a,b) 

} 

(b) 

Figure 3. Sequential and parallel mergesort processes. 

for constants Ca.Cb > 0, relating to the per-word and per-merge overheads respectively. These 
were measured as Ca = 90ns and Ct = 830ns. The runtime Ts{n, 1) of seq-msort , is expressed 
as a recurrence: 

r,(n,l) = 2r,(^,l)+r„(n) (2) 

which has the solution 

Ts{n,\)=n{Cclogn + Cd) (3) 

for constants C^Cd > 0. These were measured as Q- = 200ns and Q = 1200ns. Based on 
this we can express the runtime of par-msort as the combination of the costs of creating new 
processes, moving data, merging and sorting sequentially. The key component of this is the 
cost Tc , relating to the on statement in the parallel formulation, which is defined as 

Tc{n) =Ci + 2Cwn. 

This is because we can normalise C/ to 1 (due to Theorem [T]), the size of the procedures sent 
is constant and the number of arguments and results are both n. The initialisation overhead C,- 
was measured as 28/is, larger than that for distribute as the closure contains the descriptions of 
merge and par-msort . For the parallel runtime, the base sequential case is given by Equation[2} 
With two processors, the work and execution time can be split in half at the cost of migrating 
the procedures and data: 

r,(n,2) = r,(^)+r,(^,i)+r„H. 

With four processors, the work is split in half at a cost of Tc{n/2) and then in quarters at 
a cost of Tc{n/A). After the data has been sequentially sorted in time Ts{n/A^\) it must be 
merged at the two children of the master node in time Tm{n/2), and then again at the master 
in time T,n{n): 

Ts{n,4) Q + r, (^) + r„ (^) + Un) + Z l) 



proc seq-msort (A) is 
if |A| > 1 then 

{ fl aliases A [0..|A 1/2-1] 
; b aliases A [/..|A I] 
; seq-msort (a) 
; seq-msort (b) 

; merge(A,a,b) 

} 

(a) 
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Hence in general, we have: 

for n > p as each leaf sub-process of the sorting computation must operate on at least one 
data item. We can then express this precisely by substituting our definitions for T^, and 
and simphfying: 

Ts{n,p) =Cvv — (;?-!) +C,logp + Q — (p-l)+ Cfclogp + - Q.log- + Q, 
P P P\P 

=—{p-\){C^ + Ca) + {Q + Cb)\ogp + -(cc\og-+Cd\ (4) 
P P \ P J 

For p=l, this reduces to Equation [3j This definition allows us to express the a lower bound 
and minimum for the runtime. 

3.2.2. Lower Bound 

We can give a lower bound T™ on the parallel runtime Ts{n,p) such that V«,j9 

7;(n,p)>Tr. 

This is obtained by considering the parallel overhead, that is the cost of distributing the 
problem over the system. In this case it relates to the cost of process creation, including 
moving processes and their data, the component of T,: 

logp 



Tf(n,/.)=£r,.(^) 



k=\ ^ 
In 

= Q\ogp + C„ — {p-\). (5) 

Equation [5] is then the sum of the costs of process creation and movement of input data. 
When n = 0, T™ relates to Equation [T| this is the cost of transmitting and initiating just the 
computations over the system. For n > 0, this includes the cost of moving the data. 

3.2.3. Minimum 

Given an input of length m < n for some sub-computation of par-msort , creation of a remote 
branch is beneficial only when the cost of this is less than the local sequential case: 

Tc ( 2 ) + (^-, 1 j + r^(n) < Um, 1) 

Tc (I) + (f , l) + Un) < 27;. (|, l) + Um) 

(m\ (m 
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Hence, initiation of a remote sorting process for an array of length n is beneficial only when 

r,(n)<r,(n,i). 

That is, the cost of remotely initiating a process to perform half the work and receiving the 
results is less than the cost of sequentially sorting m/2 elements. Therefore at the inflection 
point we have 

T,{n)=Z{n,l). (6) 

3.2.4. Results 

Figure |4] shows the measured execution time of par-msort as a function of the number of 



processors used for varying input sizes. Figure 4a shows just three small inputs. The smallest 



possible input is 256 bytes as the minimum size for any sub-computation is 1 word. The 
minimum execution time for this size is at p = 4 processors, when the array is subdivided 
twice into 64 byte sections. This is the point given by Equation [6] and indicates directly the 
total cost incurred in offloading a computation. For p < 4, the cost of sorting sequentially 
dominates the runtime, and for p > 4, the cost of creating a new processes and transferring 
the array sections dominates the runtime. With the next input of size 512 bytes, the minimum 
moves to p = 8, where the array is again divided into 64 byte sections. This holds for each 
input size and in general gives us the minimum size for which creating a new process will 
further reduce the runtime. 

The runtime lower bound T™(0,;?) given by Equation[5]is also plotted on Figure 4a This 



shows the small and sub-linear cost with respect to p of the overheads incurred with the dis- 
tribution and management of processes around the system. Relative to Ts{64,p) this consti- 
tutes most of the overall work performed, which is expected as the array is fully decomposed 



into unit sections. For larger sized inputs, as presented in Figure 4b, this cost becomes just a 
fraction of the total work performed. 

Figure |5] shows predicted execution times for par-msort for larger p and n. Each plot 
contains the execution time Ts as defined by Equation [4| and T™ with and without the transfer 



of data. Figure|5a]gives results for the smallest input size possible to sort on 1024 cores (4kB) 

^ s 



and includes the measurements for T™(0, p) and T^. It reiterates what was shown in Figure 4a 
and shows that beyond 64 cores, very little penalty is incurred to create up to 1024 sorting 
instances, with T™ accounting for around 23% of the total runtime for larger systems. This is 



due to the exponential growth of the distribution mechanism. Figure 5b gives results for the 
largest measured input of 32kB, showing the same trends, where T™ this time is around just 
3% of the runtime between 64 and 1024 cores. 



Figure [5c] and Figure 5d present predictions made by the performance model for more 



realistic workloads of 10MB and 1GB respectively. Figure 5c shows that 10MB could be 
sorted sequentially in around 7s and in parallel in at least 0.6s. Figure [5d| shows that 1GB 
could be sorted in just under 15m sequentially or at least Im in parallel. What these results 
make clear is that the distribution of the input data dominates and bounds the runtime and 
that the distribution of data constituting the process descriptions is a negligible proportion 
of the overall runtime for reasonable workloads. The relatively small sequential workload 
0{n/ p\og{n/ p)) of mergesort, which decays quickly as p increases, emphasises the cost of 
data distribution. For heavier workloads, such as 0{{n/p)^), we would expect to see a much 
more dramatic reduction in execution time and the cost of data distribution still eventually to 
bound runtime, but then by a relatively fractional amount. 



12 



J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XSl Architecture 



Tf{Q,p) 
r,(256B,/?) 
U5l2Q,p) 




16 32 



P 



64 



(a) Log-linear plot for varying small inputs. 




r,(256B,p) 

T,{lkB,p) 
U2kB,p) 
T,{4kB,p) 
mkB,p) 
U16kB,p) 
T,{32kB,p) 



8 16 32 64 
P 

(b) Log-log plot for larger inputs. 



Figure 4. Measured execution time of par-msort as a function of the number of processors, (a) highlights the 
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Figure 5. Predicted (★) performance of par-msort for larger n and p < 1024. All plots are log-log. 
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4. Conclusions 

This paper presents the design, implementation, demonstration and evaluation of an efficient 
mechanism for dynamically creating computations in a distributed memory parallel com- 
puter. It has shown that a computation can be dispatched to a remote processor in just tens 
of microseconds, and when this mechanism is combined with recursion, it can be used to 
efficiently implement parallel growth. 

The distribute algorithm demonstrates how an empty array of processors can be populated 
with a computation exponentially quickly. For 64 cores, it takes just 1 14.60/is and for 1024 
cores this will be of the order of 190/is. The par-msort algorithm extends this by performing 
additional computational work and communication of data which allowed us to obtain a 
clearer picture of the cost of process creation with respect to varying problem sizes. As the 
cost of transferring and invoking remote computations is related primarily to the size of the 
closure, this cost grows slowly with system size and is independent of data. With a 10MB 
input, it represents around just 0.001% of the runtime. 

The sorting results also highlight two important issues: the granularity at which it is 
possible to create new processes and costs of data movement. They show that the computation 
can be subdivided to operate on just 64 byte chunks and for performance to still be improved. 
The cost of data movement is significant, relative to the small amount of work performed 
at each node; for more intensive tasks, these costs would diminish. However, these results 
assume a worst case, where all data originates from a single core. In other systems, this 
cost may be reduced by concurrent access through a parallel file system or from prior data 
distribution. 

The XSl architecture provides efficient support for concurrency and communications 
and the XK-XMP-64 provides an optimal transport for the described algorithms, so we expect 
our lightweight scheme to he fast, relative to the performance of other distributed systems. 
Hence, the results provide a convincing proof-of-concept implementation, demonstrating the 
kind of performance that is possible and, with respect to the topology, establish a reasonable 
lower bound on the performance of the approach presented. The results generalise to more 
dynamic schemes where placements are not perfect and other larger architectures such as 
supercomputers, where interconnection topologies are less well connected and communica- 
tion is less efficient. In these cases, the approach applies at a coarser granularity with larger 
problem sizes to match the relative performance. 



5. Future Work 

Having successfully designed and implemented a language and runtime allowing explicit 
process creation with the on statement, we will continue with our focus on the concept of 
growth in parallel programs and plan to extend the work in the following ways. Firstly, by 
looking at how placement of process closures can be determined automatically by the run- 
time, relieving the programmer of having to specify this. Secondly, by implementing the lan- 
guage and runtime with C and MPI to target a larger platform, which will provide a more 
scalable demonstration of the concepts and their generality. And lastly, by looking at generic 
optimisations that can be made to the process creation mechanism to improve overall perfor- 
mance and scalability. More details about the current implementation are available onlineR 



1 



http: //www. cs .bris . ac.uk/~h.anlon/sire 
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where news of future developments will also be published. 
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